feat: add docker agent serve chat command (OpenAI-compatible API)#2510
Open
dgageot wants to merge 19 commits intodocker:mainfrom
Open
feat: add docker agent serve chat command (OpenAI-compatible API)#2510dgageot wants to merge 19 commits intodocker:mainfrom
docker agent serve chat command (OpenAI-compatible API)#2510dgageot wants to merge 19 commits intodocker:mainfrom
Conversation
40d4ebc to
8711dac
Compare
trungutt
previously approved these changes
Apr 26, 2026
dgageot
added a commit
to dgageot/cagent
that referenced
this pull request
Apr 27, 2026
Demonstrates the OpenAI-compatible HTTP server introduced in PR docker#2510. Uses the official github.com/openai/openai-go SDK pointed at the local chat server's /v1 base URL and runs an interactive REPL with streaming, history retention, and graceful Ctrl-C shutdown. Run `docker agent serve chat ./agent.yaml` in one terminal, then `go run ./examples/chat` in another. Assisted-By: docker-agent
8711dac to
594ad2b
Compare
Member
Author
Update — expanded scopeI pushed a force-update onto this branch with 18 additional commits on top of the original New commits (oldest → newest)Example
Hardening (trivial / opt-in, safe defaults)
Auth & deployment
Performance
Protocol surface
Bug fixes (found by review pass)
CI
Breaking changesNone on the wire (the API only gains new optional features). For programmatic Go callers of Happy to split this into separate PRs if reviewers prefer; commit messages are written to be cherry-pickable. Assisted-By: docker-agent |
Expose any docker-agent agent through an OpenAI-compatible HTTP
server, so tools that already speak the Chat Completions protocol
(Open WebUI, the official `openai` SDKs, ad-hoc curl scripts, etc.)
can drive an agent without any custom integration.
Endpoints:
GET /v1/models — lists exposed agents as OpenAI models
POST /v1/chat/completions — runs the agent; supports stream: true
(Server-Sent Events) and false
The team is loaded once at startup and shared across requests; each
chat completion gets a fresh session and runtime. Tool calls and
elicitation prompts are auto-handled (this is a non-interactive
endpoint). The `model` field can pin a specific agent in a multi-
agent team, or is ignored and the team's default agent runs.
Implementation notes:
- New `cmd/root/chat.go` cobra command (default 127.0.0.1:8083,
--agent / --listen flags) wired into `cmd/root/serve.go`.
- New `pkg/chatserver` package, split into:
- server.go — Run, router, HTTP handlers, sseStream, errors
- agent.go — agentPolicy, buildSession, runAgentLoop, sessionUsage
- types.go — request/response shapes
- Reuses `openai.Model` from github.com/openai/openai-go/v3 for
/v1/models. Other OpenAI SDK response types serialise too noisily
with stdlib `encoding/json` (the SDK relies on its internal
`apijson` package which we can't import), so request/response
shapes are hand-rolled for clean output.
- Defensive event handling in runAgentLoop: ToolsApproved=true and
NonInteractive=true mean the runtime never blocks for confirmation
in normal flow, but ElicitationRequestEvent must still be answered
or the runtime would hang on its dedicated channel.
Tests cover session-building, agent-policy, error-envelope shape,
and the three early-validation paths of /v1/chat/completions via
httptest. Validated with `mise lint` (0 issues), `mise test` (all
packages green), and a curl smoke test against examples/42.yaml.
Fixes docker#2502
Assisted-By: docker-agent
Demonstrates the OpenAI-compatible HTTP server introduced in PR docker#2510. Uses the official github.com/openai/openai-go SDK pointed at the local chat server's /v1 base URL and runs an interactive REPL with streaming, history retention, and graceful Ctrl-C shutdown. Run `docker agent serve chat ./agent.yaml` in one terminal, then `go run ./examples/chat` in another. Assisted-By: docker-agent
The chat server used to set `Access-Control-Allow-Origin: *` on every response, which makes it unsafe to expose on anything other than loopback. Replace the wildcard with an explicit per-server allow-list of one origin and disable the CORS middleware entirely when the flag is empty. - Introduce `chatserver.Options` so future improvements can extend the server configuration without breaking the `Run` signature on each change. - Add `--cors-origin` flag to `docker agent serve chat`. Default empty = no CORS headers emitted. - Update tests; fix three pre-existing `noctx` lint failures in handlers_test.go that surfaced when the PR was rebased onto current main. Assisted-By: docker-agent
Hostile or buggy clients could previously stream gigabytes into the chat completions endpoint or hold a goroutine open indefinitely on a slow upstream model. Cap both via Echo middleware: - `BodyLimit` defaults to 1 MiB (configurable via `--max-request-size`). Oversized bodies now return 413 instead of being silently buffered. - A new `requestTimeoutMiddleware` wraps `c.Request().Context()` in `context.WithTimeout` so model + tool calls + SSE streaming all share a single deadline. Default 5 minutes, configurable via `--request-timeout`. Both limits are exposed on `chatserver.Options` (`MaxRequestBytes`, `RequestTimeout`); zero values fall back to package defaults. Tests cover oversized body rejection and deadline propagation through the middleware chain. Assisted-By: docker-agent
Previously runAgentLoop would record only the first ErrorEvent and drop every subsequent one on the floor while still draining the stream. That made debugging a multi-error run frustrating: only the earliest symptom was ever surfaced, even though later events often held the actual root cause (a model timeout followed by a tool call that couldn't connect, for instance). Switch to a slice of errors and join them with `errors.Join` at the end. The handler's behaviour for callers is unchanged when a single error occurs; multi-error runs now surface a wrapped error whose `Unwrap() []error` makes each cause inspectable. Assisted-By: docker-agent
Until now a runtime error mid-stream was injected into the assistant
content as `[error: ...]` and the stream still closed with
`finish_reason: "stop"`. Clients matching on the OpenAI protocol had
no programmatic way to tell a successful completion apart from a
failed one.
Switch to OpenAI's actual on-the-wire shape: emit a separate
`data: {"error": {...}}` envelope, then terminate the stream with
`finish_reason: "error"` before the `[DONE]` sentinel. Successful
runs continue to terminate with `finish_reason: "stop"`.
Add a unit test on the new `sseStream.sendError` covering the wire
format.
Assisted-By: docker-agent
OpenAI clients regularly send `temperature`, `top_p`, `max_tokens`, and `stop` on every chat completion request. The server used to drop them silently because the request struct didn't declare them, so typos and out-of-range values went unnoticed until the upstream provider eventually returned an opaque error several seconds later. - Add `Temperature`, `TopP`, `MaxTokens`, `Stop` to `ChatCompletionRequest` so the OpenAPI schema matches what the wire protocol allows. - `Stop` is JSON-flexible: clients send either a single string or an array, and OpenAI accepts both. Custom `UnmarshalJSON` handles the union shape. - `validateSamplingParams` range-checks the new fields and rejects bad input with a 400 invalid_request_error, matching how OpenAI itself behaves. Plumbing these values through the runtime to the model layer requires per-request overrides that don't exist today; that work is tracked separately. Validating up front is the user-visible win and unblocks future plumbing. Assisted-By: docker-agent
The chat server is unauthenticated by default, which is fine on loopback but unsafe anywhere else. Add an opt-in static bearer-token gate so the server can be safely bound to a LAN interface. - `chatserver.Options.APIKey`: when non-empty, every request to /v1/* must carry `Authorization: Bearer <token>` or it is rejected with 401. Empty preserves the previous unauthenticated behaviour. - `bearerAuthMiddleware` uses `subtle.ConstantTimeCompare` to dodge timing-side-channel leaks. CORS preflight (OPTIONS) is exempted so browsers can negotiate before sending the auth header. - `--api-key` and `--api-key-env` flags expose the option from the CLI; the env-var form keeps secrets out of process listings. Tests cover missing/wrong/correct tokens and the OPTIONS exemption. Assisted-By: docker-agent
Until now the server was strictly stateless: every chat completion request rebuilt a fresh session from the messages array, so clients paid the tokenization cost of replaying the full history on every turn. That works but is wasteful for long conversations. Add an opt-in conversation cache: - `chatserver.Options.ConversationsMaxSessions` enables an in-memory LRU keyed by the `X-Conversation-Id` request header. `Options.ConversationTTL` (default 30 min) bounds idle lifetime; expired entries are evicted lazily on access and on Put. - When a request carries a known id, the server reuses the existing session and only appends the latest user message from the request body. The session already has the prior turns. When the id is unknown (or the header is absent), the server falls back to the previous behaviour and builds a session from scratch. - New `--conversations-max` and `--conversation-ttl` CLI flags expose the feature. Default 0 keeps the old stateless behaviour. The cache implementation is a simple map + mutex with O(n) LRU scan; that's appropriate for the small caches typical for this feature, and avoids pulling in a new dependency. Tests cover Put/Get, TTL expiry, LRU eviction, Delete, and the new appendLatestUser helper. Assisted-By: docker-agent
Every chat completion request used to call `runtime.New` from scratch: that resolves the agent's tools, builds per-agent hook executors, and allocates per-runtime resume/elicitation channels. On a busy server those allocations show up in profiles. Add an opt-in pool so a small number of warm runtimes per agent can be reused across requests: - `chatserver.Options.MaxIdleRuntimes` (default 4 via `--max-idle- runtimes`) bounds the idle pool size per agent. 0 disables pooling entirely and restores the original "fresh runtime per request" behaviour. - `runtimePool.Get` returns a recycled runtime when one is idle, or creates a new one. `Put` returns it to the pool on completion; overflow is dropped on the floor (the team owns the toolsets, so nothing leaks). - A runtime is *not* safe for concurrent `RunStream` calls (its resume/elicitation channels are per-runtime), so the pool hands out at most one borrow per runtime at a time. Concurrency comes from holding multiple runtimes per agent. Assisted-By: docker-agent
The previous commit only accepted a single literal origin. Real deployments often need to allow several front-ends or all subdomains of a known SaaS. Extend the flag's grammar: - comma-separated entries form an explicit allow-list, each matched exactly; - entries prefixed with `~` are compiled as Go regex and matched against the request's `Origin` header at request time; - the literal `*` wildcard is preserved for the (rare) cases where the operator really wants it; - literal entries are validated up front: scheme must be http/https, no path/query/fragment, no missing host. Mistakes are caught at startup rather than producing silent allow-none behaviour at runtime. When the spec parses cleanly to nothing usable, the middleware is left unregistered and a slog.Error documents the misconfiguration. Tests cover the parser's accept/reject set and exercise allow-list + regex routing through the real Echo middleware. Assisted-By: docker-agent
When the agent invokes a tool, clients had no way to see what
happened: tools ran inside the runtime, the assistant's eventual
text output sometimes referenced them but often didn't, and the
streaming protocol carried only the model's plain content. That's
fine for a black-box transcript but useless for a chat UI that
wants to render "🔧 calling search(query=…)" badges.
Use OpenAI's standard `tool_calls` shape on both response styles:
- Add `ToolCallReference` (mirrors OpenAI's tool_call entry) with
`index`, `id`, `type`, `function.{name,arguments}`.
- `ChatCompletionMessage.ToolCalls` populated on the non-streaming
response so the assistant message lists every tool the agent
invoked.
- `ChatCompletionStreamDelta.ToolCalls` carries one tool per delta
in streaming mode. The runtime hands us complete arguments, so
one chunk per call is sufficient (vs. OpenAI's incremental
argument streaming, which clients accumulate either way).
- `runAgentLoop` now takes an `agentEmit` struct with
`onContent` and `onToolCall` hooks instead of a single content
callback. Both handlers fill in their respective hooks; missing
ones are no-ops.
Tools still execute server-side; this commit is purely about
client observability. Surfacing results back through the protocol
(so clients could intercept / replay them) is left for a future
change.
Assisted-By: docker-agent
594ad2b to
7f16975
Compare
Add a static OpenAPI 3.1 document describing /v1/models, /v1/chat/completions, the new tool_calls fields, the X-Conversation-Id header, and the bearer-auth security scheme. - The spec is hand-written and embedded with `//go:embed`. That keeps it easy to review (it's plain JSON, not generated noise), trivial to update when the API changes, and free of generation steps in the build. - A new `GET /openapi.json` route serves the spec verbatim. - `bearerAuthMiddleware` exempts /openapi.json so introspection tooling can discover the API even on locked-down deployments — there's no secret in the spec, only the shape of the API. Tests cover both the document shape (correct paths advertised) and the auth bypass. Assisted-By: docker-agent
OpenAI's chat protocol lets the `content` field of a message be
either a string or an array of typed parts:
"content": [
{"type": "text", "text": "What is in this picture?"},
{"type": "image_url", "image_url": {"url": "..."}}
]
The chat server used to drop the parts variant on the floor: the
field was typed as `string`, so multi-part requests deserialised
to an empty content and the request was rejected as having "no
user message". That made the server unable to serve any
vision-capable agent.
- Replace the plain `Content string` with a JSON-union
(un)marshaller. `Content` still carries a flat-text view for
string-form content and for the concatenated text of parts; a
new `Parts []ContentPart` field holds the typed entries when the
array shape is used. Existing Go callers (and every test that
still writes `Content: "..."`) keep working unchanged.
- `convertParts` translates the wire shape to the runtime's
`chat.MessagePart` union (text + image_url), so the model
provider sees the actual image. Unknown part types are dropped
gracefully so future part kinds degrade rather than 500.
- `appendLatestUser` (used by X-Conversation-Id continuation) gets
the same multi-part path.
- The OpenAPI spec advertises the union shape and the new
ContentPart schema.
Tests cover string/array round-trips, image_url plumbing into the
session, and (still passing) all the pre-existing behaviour.
Assisted-By: docker-agent
When a conversation is evicted from the LRU cache while a request is processing it, the updated session was not being stored back because maybeStoreConversation only called Put when isNew=true. This caused conversation state to be lost when: 1. Request R1 retrieves conversation C from cache (isNew=false) 2. R1 processes the request, updating the session 3. Meanwhile, C is evicted due to LRU policy 4. R1 finishes and calls maybeStoreConversation(C, sess, false) 5. Since isNew=false, Put was not called 6. The updated session is lost Fix: Always call Put, regardless of isNew flag. This ensures the updated session is stored and refreshes the lastUsed timestamp, preventing premature eviction of active conversations. The Put operation is idempotent and safe to call multiple times for the same conversation ID. Assisted-By: docker-agent Assisted-By: docker-agent
The isNew flag was used to decide whether to call Put on the conversation store, but after the previous fix, we always call Put regardless of whether the conversation is new or existing. This commit removes the now-unused isNew parameter from resolveSession and maybeStoreConversation, simplifying the code. Assisted-By: docker-agent
Add a test that verifies a conversation evicted from the LRU cache while a request is processing it can still be stored back after the request completes. This test validates the fix in commit 9563a43 which ensures maybeStoreConversation always calls Put, preventing loss of session state when a conversation is evicted during request processing. Assisted-By: docker-agent
The previous fix accidentally deleted the doc-comment header line
on `(*server).chatCompletion`, leaving a dangling fragment
("// non-streaming OpenAI ChatCompletion object.") detached from
the function it documents.
Assisted-By: docker-agent
Concurrent requests with the same X-Conversation-Id share the same `*session.Session` pointer (the conversation cache hands out the same instance to every caller), so two simultaneous runtime RunStream calls would interleave message appends, send overlapping prompts to the model, and produce a garbled transcript. Although `session.Session` has internal mutex protection on Messages, the agent loop reads-then-writes (decide what to send, append model output) so per-field synchronisation isn't enough — the whole turn must be atomic with respect to other turns on the same id. Reject the second concurrent request with 409 Conflict instead of trying to serialise it on the server. That: - Surfaces the misuse to the caller immediately (vs. mysterious interleaving), - Keeps server-side resources bounded (no queue, no parked goroutines), - Matches how OpenAI's own conversation API expects clients to use the protocol (one request at a time per conversation). Empty conversation id and nil lock-set are no-ops, so callers without the feature enabled keep their old behaviour. The OpenAPI spec advertises the new 409 response. Tests cover acquire/release semantics, nil/empty no-ops, and a race-detector- friendly stress test that proves at most one holder of the same id at a time. Assisted-By: docker-agent
7f16975 to
0e1c5a8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #2502.
Exposes any docker-agent agent through an OpenAI-compatible HTTP server, so any tool that already speaks the Chat Completions protocol (Open WebUI, the official
openaiSDKs, ad-hoc curl scripts, etc.) can drive an agent without a custom integration.Endpoints
GET/v1/modelsPOST/v1/chat/completionsstream: true(SSE) andfalseUsage
Design
ToolsApproved=trueandNonInteractive=true— there is no human in the loop.ElicitationRequestEventis still explicitly declined to avoid hanging on the runtime's elicitation channel.modelfield of the request can pin a specific agent in a multi-agent team. If it doesn't match an exposed agent (e.g. clients that hard-codegpt-4) we silently fall back to the default agent and echo the requested model name back, so clients matching on the model field stay happy.chat.completion.chunkformat and ends withdata: [DONE].Implementation
cmd/root/chat.go(default127.0.0.1:8083,--agent/--listenflags) wired intocmd/root/serve.go.pkg/chatserverpackage, split across:server.go—Run, router, HTTP handlers,sseStream, error envelopeagent.go—agentPolicy,buildSession,runAgentLoop,sessionUsagetypes.go— request/response shapesopenai.Modelfromgithub.com/openai/openai-go/v3for/v1/models. Other SDK response types serialise too noisily with stdlibencoding/json(the SDK relies on its internalapijsonpackage, which lives underinternal/), so the chat-completion shapes are hand-rolled for clean output.Tests
httptestfor/v1/modelsshape, the three early-validation paths of/v1/chat/completions(bad JSON, empty messages, history without user), andwriteError's status→type mapping.Validation
mise lint— 0 issuesmise test— all packages greenexamples/42.yaml:/v1/modelsreturns the agent, error paths return correct OpenAI-shaped envelopes.